# Full text of "Coding Theory lecture: entropy and mutual information"

## See other formats

Department of Mathematics, Mahidol University Kit Tyabandha, PhD Entropy and mutual information 4 th November 2005 Definition 1. A probability space is a triple (S,B,P) on the domain S, which is a nonempty set called the sample space , where (S, B) is a measurable space, B is a Borel field of subsets of S, and P is a measure on S with the property that P(S) = 1 and, for all disjoint E,j G B, ( oo \ oo U =E p (^) i= 1/ i =1 In other words, P is a nonnegative function defined for all events E t G B, and B measurable subsets of S. Further, a random variable N is a function mapping S into some set R, called the range of X. For convenience, we shall also use X to represent both the function and its own range, that is X is a function which maps S into X. If S is discrete and f is some real-valued function defined on S, then both X and f(X) are two different random variables, and the expectation of the latter is given by, E[f(X)] = ^p(.r)f(.r) Definition 2. Let p(x) be the probability that x G X occurs, similarly p (y) that y G Y does-, while p (x,y) that both x G X and y G Y do occur. Then, and , \ ^ P (x,y) p,l|s) = TT p(»w = ( 1 ) ( 2 ) Definition 3. A Markov chain is a set of random variable X t , where t = 0, 1, ..., such that, P(X t = j|X 0 =*o,...,X t _ 1 =i t _ 1 ) = P(X t = j|X t _i = i t -i) In other words, given the present state, the next state is conditionally independent of the past. § Definition 4. A subset K C E n , where E" is the Euclidean space of n dimensions, is called convex if the line segment joining any two points in K is contained in K. Let the two points be x\ and X 2 , then the line segment joining them together is x = tx\ + (1 — t)x 2 , where 0 < t < 1. § Definition 5. A point x is said to be a convex combination of points xi,...,x m if there ex¬ ist nonnegative scalars a such that = 1 and J2 a i x i = x. The set of all convex combinations of Xi, i = 1, ..., m, is called the convex hull of {aq}. § Definition 6. Let f be a real-valued function, and let K be a convex subset of the domain of f. Then f is said to be convex cup if, for every X\,x -2 G K and 0 < t < 1, f(txi 4- (1 - t)x- 2 ) < ff(xi) + (1 - t) f(x 2 ) (3) It is said to be strictly convex cup if strict inequality holds in Equation 3 whenever x + 1 ^ x%, Similarly, f is said to be convex cap if, f (txi + (1 - t)x 0 ) > tf(xi) + (1 - t){(x 2 ) (4) that is to say, if — / is convex cup. It is said to be strictly convex cap if strict inequality holds in Equation 4 whenever xi ^ xi . Convex cap is also known as concave. Geometrically speaking, f is Coding theory, Entropy and mutual information -1- From 28 Oct 05, as of 4 th November, 2005 Department of Mathematics, Mahidol University Kit Tyabandha, PhD convex cup if and only if all its chords lie above or on the graph of f, and f is concave if and only if all its chords lie below or on the graph of the same. § Definition 7. Let K be some interval in E 1 , and let F(.t) be a probability distribution concen¬ trated on K such that P(X < x) = F(;r). Then, if the expectation E(X) exists, and if f(x) is a convex cup function, then, E(f(X)) > f(E(X)) (5) If f is strictly convex cup, then strict inequality holds in Equation 5. Similarly, if f is convex cap, then, E(f(X))<f(E(X) (6) If f is strictly convex cap, then strict inequality holds in Equation 6. § Example 1. Suppose that in Definition 7 there is a mass distribution placed on the graph of f, then Equation 5 says that the overall centre of mass will lie above or on the graph, while Equation 6 says that it will lie below it. Entropy is a measure of uncertainty of many events as a single value. We derive it from Axiom’s 1 and 2. Axiom 1. If the events are all equally likely, then the uncertainty function H (^,..., ^) is monotonously increasing with to. § Axiom 2. If {E \,..., E} n } and {E'f ,..., Eff] are statistically independent sets of equally likely disjoint events, then the uncertainty of the sets of events {E t f~l Ej\i = 1,..., in: j = 1,..., n} is That is to gay, h (mn) = h (to) -f h(n), where h(m) = H( — § Definition 8. Let the set of m possible disjoint events be E = {E \,..., E m }. We call an apriori probability of E,;, p(E;), where 1 < i < m and i p(T'j) = 1- The uncertainty function or the entropy function, H(p(l),..., p(m)) obeys Axiom’s 1 and 2. § The entropy of a random variable x gives a measure of the amount of information obtained from an observation of x. It also represents the randomness of x and our uncertainty about x. The less probable an event is, the more information we receive when it occurs. Theorem 1. The entropy of a set of m equally likely events is h(m) = Alog c m, where A is a positive constant and c > 1. Proof. Proving Theorem 1 amounts to proving that Axiom’s 1 and 2 are satisfied if and only if h(m) = Alogym. The two axioms say that h(m) is monotonously increasing in m and h(m) = h(m) + h (n) (7) According to Equation 7, if m = n = 1, then h( 1) = h( 1) + h(l), which implies that h(l) = 0. From this together with both axioms above, h(m) = Alog c m is sufficient as a solution. Next, we must prove that this solution is necessarily the only solution. Let a, b and c be positive integers, and a,b,c > 1. Then there exists a unique integer d such that c d < a b < c d+1 (8) From Equation 8 it follows that, dlogc < b log a < ( d + 1) logc Coding theory, Entropy and mutual information -2- From 28 Oct 05, as of 4 th November, 2005 Department of Mathematics, Mahidol University Kit Tyabandha, PhD and therefore, d log a d + 1 - < — < - b log c b Since h(m) is monotonously increasing, from Equation 8 we have, h(c rf ) < h(« 6 ) < h(c d+1 ) (9) Then from Equation 7, dh(c) < bh{a) < (d + l)h(c). And since h(m) is monotonously increasing, d h(a) d + 1 - < —— < - b h(c) b From Equation’s 9 and 10 it follows that, ( 10 ) log a log c M«) h(c) And, since b is arbitrary positive integer, h(a) logo h(c) logc h(a) h(c) log a log c Since a and c are arbitrary, h(«) = A = h(c) log a log c Therefore, necessarily h(m) = Alogym is the only solution. f Axiom 3. The total uncertainty of events does not depend on the method of indication. § Axiom 4. The uncertainty measure is a continuous function with regard to the probabilities within it. § Example 2. Let a set E of rri disjoint events be {£j,..., E m }. Let , i = 0,..., n, be integers and 0 = jo < j i < jo • • • < j n = m, and E be divided into n sets of events, namely, Gi = {Ei,... ,Ej 1 } Go = {Ej 1+ 1, • • •, E h ) Gn — {Ej n _ 1 - {-1, • • •, E m } If we indicate firstly the group, and then the event within that group, then the uncertainty becomes, H(p(£d),... ,p(E m )) = H(p(G 1 ),... ,p(G„)) + Y, P(G*)H(p(Ej i _ 1+1 |G*),... ,p(E* |G,;)) (11) %■=! The grouping axiom, Axiom 3, lets us express the uncertainty when all the event probabilities are rational. By grouping equally likely events together and then consider each of the groups as a single event, it gives us the ability to deal with events which are not equally likely. Example 3 gives an example how this is done. Then Axiom 4 extends Axiom 3 to cover also irrational probabilities, and Equation 12 is the result. Coding theory, Entropy and mutual information -3- From 28 Oct 05, as of 4 th November, 2005 Department of Mathematics, Mahidol University Kit Tyabandha, PhD Example 3. As in Example 2, let a set of disjoint events be E = {E \,..., E m }, and let p(E,) = -E, i = 1,, m. Also, let the groups of events Gi,..., G„ be defined the same way therein. Let n* be the number of events in G k . Then n k = ju — jk-i and p(G^) = ^, for k = 1...., n, and also p(Ei\G k ) = for i < i < j k . Then Equation 11 yields, h(m) = H(p(Gi),..,.p(G„)) + ^p(G,;)h(n, : ) i =1 And since from Theorem 1, h(m) = Alog c ra, we have, n H(p(Gi),... ,p(G„)) = - ^p(Gi)(h(ni) - h(m)) i =1 n = -Ep(G‘) ( A 1 °eS) i =1 = -A ^^P(Gi)logp(Gi)J (12) Example 4. From h(m) = Alog c m, if we let A = log 6 c, then h(m) = log 6 m. In other words, the scale factor A can be absorbed in the base of the logarithm. Theorem 2. Let {pi,... , p m } be a set of probabilities such that Yl'lLiPi = 1- Then, f H(p 1 ,...,p m ) = -^p i logp i (13) i=l Proof. This is the results from Example’s 2 and 3, and the scale factor A disappears in a manner similar to that shown by Example 4. f Example 5. If the base of the logarithm in Equation 13 is 2, the unit of the entropy is bit. On the other hand if this base is e, that is to say, if we use natural logarithms, then the uncertainty has the unit of nat. From this, one may see that one nat is equal to log 2 e bits, which is approximately 1.443 bits. The term bit comes from binary digit, the term nat from natural digit. Definition 9 explains what is meant by conditional entropy. Starting from Equation 14, which is an equation for conditional entropy when y is given, we obtain the overall conditional entropy in Theorem 3. For any pair of sets X and Y given, H(X|Y) gives the amount of uncertainty remaining about X after Y has been observed. Definition 9. The conditional entropy of X, given some y £ Y, is, H(X|-i/) = -^p( ; c|p)logp(;r||/) (14) Then the conditional entropy H(X| Y) is the expectation, or average value, of H(X|p) over the range Y. In other words, H(X|Y)=£p(j,)H(X|j/) (15) y Theorem 3. The conditional entropy is, H(X|Y) = -•£ p(x, y) logp(x|y) f Some times the entropy function is defined instead by H(pi,... ,p m ) = YliLi Pi l°g but this is obviously the same as our Equation 13 since logai -1 = — log;r. Coding theory, Entropy and mutual information -4- From 28 Oct 05, as of 4 th November, 2005 Department of Mathematics, Mahidol University Kit Tyabandha, PhD Proof. Putting the equation of conditional entropy when y is given, Equation 14, into the overall conditional entropy equation, Equation 15, we get, H(X|Y) = ^p(j/)H(X|i/) y y x Then from Equation 1 of Definition 2, p(y)p(x\y) — p (x,y), and so, h<xiy) = -x; P (x,y) logp(;c|y) Theorem 4. Let X, Y and Z be discrete random variables. For each z G Z, let E (z) = T, x , y p(yM z W,y)- Then , H(X|Y) < H(Z) + E(logE) Proof. H(X|Y) = —E [logp(a,’|j/)] = - E v{x,y,z)logv{x\y) x,y,z x,y p(*) Because P jz,y,z) pW = p(x,y\z) is a probability distribution, that is a convex cap function, we may apply Equation 6, namely Jensen’s inequality for convex cap, from Definition 7. Hence, H(X|Y)<£p(*)log P {x,y,z p(-2) pW y) = y , p(-) log -E + E p(-) lo s E But, p(s) p (x,y,z) _ p{x,y,z)p{y) p (x,y) pQujbf) p(a-| y) = p(.'/!p(~]- r - y) p(*l y) hence the statement above is proved. f Corollary 4[1]. Let X adn Y be random variables each of which takes values in the set {;ci ,... ,x r }. Let P e = P(X Y). Then, H(X|Y) < H(P e ) + P e log(r — 1) Proof. From Theorem 4, let Z = 0 if X = Y, and let Z = 1 if X ^ Y. Then E(0) = 1 and E(l) = r — 1. 1 Theorem 5. The maximum uncertainty occurs when the events are equiprobable. Proof. Since, H ) - H(p].,... ,p m ) = log 6 m + Vpj log b pi V m m) i =1 = log,, e ^2 Pi In m Pi i=1 m / i \ > io s6 E p' 1 — = ° ' \ m/n■ / i =1 mpi it being the case that In i > 1 — x. Therefore H(pi,... ,p m ) is maximised when p t = for all i = If Coding theory, Entropy and mutual information -5- From 28 Oct 05, as of 4 th November, 2005 Department of Mathematics, Mahidol University Kit Tyabandha, PhD Example 6. Figure 1 shows that In 2 < a; — 1, while Figure 2 shows that such inequality does not exist when the logarithm in question is of base 10. Figure 1 Plots of In 2 ; and x — 1, which show that In x < x — 1. In x < x — 1 Figure 2 Graphs of y = log a; and y = x — 1, which show that the latter is no bound for the values of the former. log 10 x and x — 1 Example 7. Figure 3 confirms for us how In 4 > 1 — x. whereas Figure 4 tells us that this is the case for log ^. Figure 3 Plots showing In ^ and 1 — x, which show that In ^ > 1 — x. In | < 1 — x Figure 4 Graphs showing y = log - and y = 1 — x, from which it is clear the latter gives no bounds for the former. Coding theory, Entropy and mutual information -6- From 28 Oct 05, as of 4 th November, 2005 Department of Mathematics, Mahidol University Kit Tyabandha, PhD logio j and 1 — x Example 8. Consider two events with probabilities p and 1 — p. The entropy function is then, H(p, 1 -p) = -plogp- (1 -p)log(l -p) Whenever the occurrence of either event become certainty, the entropy function would become zero. Mathematically we see that lim p _> 0 plogp = 0 and lim p _>i plogp = 0. Figure 5 shows a plot of the values of the entropy function for two events. Base-2 logarithm is used here. Figure 5 The entropy function of two events with probabilities p and 1 — p. = log 2 p (1 -p)log 2 (l -p) Definition 10. The mutual information is I(X; Y) = H(X) — H(X|Y). It represents the informa¬ tion provided about X by Y. § Example 9. Alternatively, the mutual information may take the following form, cf Definition 2, I(X;Y) = £>(*,!/) log x,y p(fjj/) p(x-) = J2p{x,y) log x,y p (x,y) p(-c)p (y) = log x,y p(:i/|a-) p (y) Coding theory, Entropy and mutual information -7- From 28 Oct 05, as of 4 th November, 2005 Department of Mathematics, Mahidol University Kit Tyabandha, PhD That is to say, I(X; Y) is the average taken over the X, Y sample space of the random variable I(x; y 0 such that, l(x;y) = log p(gj y) p(a-) log P jx,y) p(x)p{y) log p(:i/|a-) p (y) Theorem 6. For any discrete random variables X and Y, I(X; Y) > 0. Moreover, I(X; Y) = 0 if and only if X and Y are independent. Proof. From one of our formulae for the mutual information and from Jensen’s inequality, I(X;Y) E lo s p(a-)p (y) P (x,y) > log ^2 P( a 0P(2/) = log 1 = 0 x,y Furthermore, the equality sign holds if and only if p(ai)p(«/) = p(.x, y) for all x and y, that is to say, when X and Y are independent of each other. f Example 10. From our formulae of the mutual information, we may see that, I(X; Y) = I(Y; X) and I(X; Y) = H(Y) — H(Y|X) Also, IKYI^pC^llog^ Definition 11. Let X, Y and Z be three random variables. Then the mutual information I(X,Y;Z) is given by, I(X, Y; Z) = E P(fjfM/A p(*) ) ^2 3) log x,y,z p(-| x,y) P(z) This mutual information is the amount of information X and Y provide about Z. § Theorem 7. Let X, Y and Z be three random variables. Then we have I(X,Y;Z) > I(Y;Z), where the equality holds if and only if p (z\ae,y) = p(z\y) for all (./•. y. z) such that p (x,y,z) > 0. Proof. I(Y; Z) — I(X, Y; Z) = E = E P(%) _ P(-) p(^|y) p(z\x,y) log p pW J = ^2 p( ;r >i/,-)iog x,y,z p{z\y) p(z[x,y) Then using Jensen’s inequality, we have, I(Y; Z) — I(X, Y; Z) < log ^2 P (x, y,z) x,y,z p(z\y) p{z\x,y) = log ^2 p ( ;r ’ y)vi z \y) = log i = o x,y,z f Coding theory, Entropy and mutual information -8- From 28 Oct 05, as of 4 th November, 2005 Department of Mathematics, Mahidol University Kit Tyabandha, PhD Theorem 8. Let (X, Y,Z) be a Markov chain. Then, Proof. From Theorem 7, I(X;Z) < I(X,Y;Z). Because (X,Y, Z) is a Markov chain, I(X,Y;Z) = I(Y;Z). Therefore I(X; Z) < I(Y;Z). Next, since (X,Y, Z) is a Markov chain, (Z,Y,X) is also a Markov chain. Hence I(X; Z) < I(X;Y). f Bibliography Solomon W Golomb, Robert E Peile and Robert A Scholtz. Basic concepts in information theory and coding , The adventures of Secret Agent 00111. Plenum Press, New York, 1994 Robert J McEliece The theory of information and coding. Addison-Wesley, 1977 Coding theory, Entropy and mutual information 9 From 28 Oct 05, as of 4 th November, 2005